add markdown formatter / exporter #1976

mayel · 2024-12-08T10:40:36Z

This is a quick proof of concept of https://elixirforum.com/t/generate-docs-markdown-similar-to-epub/67946

lib/ex_doc/cli.ex

lib/ex_doc/formatter/markdown.ex

josevalim · 2024-12-09T14:13:35Z

Thank you @mayel, I think the general direction is good and I think we can continue exploring it. The testing structure will be very important too.

Just a heads up, we will be slow with reviews on our side, since we are focused on Elixir v1.18 and launching Livebook Teams.

mjrusso · 2024-12-26T14:35:24Z

This is awesome @mayel! And thanks for letting me know about this PR :)

Having spent some time thinking about this, a few requirements suggestions for discussion/debate. My ideal Markdown export would generate:

an exact mirror of existing HTML page structure, but in Markdown format (with .md extension for each file, and working hyperlinks between all .md files)
at least one "single file" export with all documentation included in a single Markdown file
- (there might be opportunities to generate a few different types of "single file" exports; here's what hex2txt is currently doing, but it would probably make sense for exdoc to generate a version of the single-file export with all "extras" included)
a top-level Markdown file that simply links to every generated Markdown file (like a sitemap, i.e. the intention behind the llms.txt proposal)

I would also recommend generating all of this by default, so tooling can start to rely on these files existing :)

mayel · 2024-12-26T15:09:38Z

@mjrusso Thanks for the feedback!

The structure of the files and markdown contents should already match that of the html docs.

In terms of a single file, that was the intention of the ZIP archive containing all the md docs, so it can easily be downloaded from hexdocs and devs can choose which files/modules they want to add as context rather than always including everything, but now I'm thinking we could add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

I've pushed some WIP I hadn't staged which includes generating an index.md with structured links to all the .md docs.

mjrusso · 2024-12-27T13:50:58Z

The structure of the files and markdown contents should already match that of the html docs.

Perfect :) I was mostly trying to enumerate my ideal requirements independent of what was already written, just for ease of debate.

add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

In general my preference would be to make Markdown generation (in whatever form we decide) the default, with no other configuration options (other than disabling it if you really don't want it) so tools can rely on a common approach.

On the topic of the zip, single file generation, etc.:

In terms of a single file, that was the intention of the ZIP archive containing all the md docs, so it can easily be downloaded from hexdocs and devs can choose which files/modules they want to add as context rather than always including everything, but now I'm thinking we could add a cli flag to generate either a single file or a ZIP with seperate files, and leave the discussion of what the default should be for later?...

Since we can already download a tarball of all docs from hex.pm (the md files, if generated, would be included by default there as well, correct?), I think we can forego the zip archive. Easy enough to get the markdown files from there. (And would these be fetched by default with mix hex.docs?)

Thinking through this a bit more, I think we could forego the single-file generation (at least for now). Realistically for AI tooling integration that works we are going to need a server in between that can manage pulling the right chunks of documentation for any given task. The individual md files being produced here provide the right building blocks.

(Also, instead of "Download Markdown version", perhaps "View Markdown version", which just links to the index.md file.)

mayel · 2024-12-27T14:03:10Z

Ah the downloadable docs from hex.pm had completely slipped me by! It may be useful to also include that link in the doc footers next to the ePub.

And yeah all makes sense to me, hoping I find some time to work on it a bit more soon :)

mayel · 2024-12-27T14:04:40Z

we can already download a tarball of all docs from hex.pm (the md files, if generated, would be included by default there as well, correct?)

The epub is included in that ZIP so I'm guessing yes

mayel · 2024-12-27T15:48:11Z

Alright I've experimented with updating the footer so it includes:

Hex Package (if a hex package is set)
View Code:
- Source Repo (if a source url is known)
- Hex Preview (if a hex package is set)
View Markdown version (if markdown formatter enabled)
Download docs archive (from hex, should include html, markdown and epub)
Search HexDocs

Note: source repo, hex preview, and view markdown version all link to the file matching the current module/page.

mayel · 2024-12-27T16:58:18Z

OK I'm starting to feel pretty good with the generated output (tested with a bunch of projects), probably missing some things but could use some feedback on the implementation and test coverage :)

josevalim · 2025-01-02T10:59:55Z

lib/ex_doc/doc_ast.ex

+  @doc """
+  Transform AST into a markdown string.
+  """
+  def to_markdown_string(ast, fun \\ fn _ast, string -> string end)


Why not render the original content instead? 🤔

We do when the original content is markdown. This was needed for cases where an AST node is created manually, like for type specs.

I see. In this case maybe we should push the functionality to the retriever, so it adds specs both in text and html format.

Or maybe we use a separate function that knows how to render the specs for a given node with the given format.

What about the autolink functions? They depend on the parsed AST of the extra markdown docs and transform it.

Edit: ah I'm not currently using to_markdown_string to use that transformed AST for the guides, but would be good to do so IMO...

Do we want the markdown links in this case? Would the markdown links be useful for man pages? cc @eksperimental

Dunno. And on the LLM side not sure if links would make a difference, any idea @mjrusso?

I agree with @mjrusso that working hyperlinks between all .md would be nice to have in general. For man pages, I'm not sure they're required, but they might be useful to be able to extract the links and do some ad-hoc processing. Man pages are often referring to other man pages on the form "name(section)", especially under SEE ALSO in the bottom of a man page.

For the OTP man pages, it's probably more useful to refer to functions using Erlang syntax like keyfind/3 or lists:keyfind/3. But maybe for modules, we can refer to them using the man page syntax like maps(3erl).

josevalim · 2025-01-02T11:02:58Z

Thanks everyone for the work so far. I believe this is a great direction and, at the same time, it shows we need to some clean up before moving forward:

We will want to refactor some of the template handling because there is too much shared logic between HTML, EPUB, Markdown. I think some of it should be moved to the retriever.
I believe it should be multiple files indeed and no zips, since we can also use it to generate man pages and either. See WIP: Markdown formatter #1992, which is now a duplicate? /cc @eksperimental

mayel · 2025-01-02T13:54:10Z

We will want to refactor some of the template handling because there is too much shared logic between HTML, EPUB, Markdown. I think some of it should be moved to the retriever.

Is this about DRY and not having duplicate logic (which I tried to address by introducing the ExDoc.Formatter module), or about having more separation between the code for each format? Some guidance would be helpful here.

I believe it should be multiple files indeed and no zips, since we can also use it to generate man pages and either.

Yeah it's generating separate files now, following the same structure/naming as the html ones.

See WIP: Markdown formatter #1992, which is now a duplicate? /cc @eksperimental

Ah yeah seems so, I can look at that PR to see if there's any approach or piece of code (thinking especially of the templates) that looks better and port them to this one?

eksperimental · 2025-01-02T14:31:23Z

Hi everyone. I woud like to discuss about this more in detail, I think we could open up an issue so we don't divert the conversation from this PR? Doing the formatter I noticed the duplication but also the limitations of the current approach.

garazdawi · 2025-01-07T15:45:14Z

I created a gist with the markdown docs for the Erlang stdlib. I think the results look good, but as mentioned in other comments it could be nice to have links working.

I also think that specs/types/callbacks should be inside ```erlang/elixir blocks. That means that we need to remove the links, but as links are not working anyway it does not matter.

I also noticed that the equiv metadata is not rendered for Erlang docs.

I did a quick attempt att fixing markdown_to_man.escript and the generated output looks nice enough:

though one can probably spend an infinite amount of time fixing the many many small formatting issues that pop up in various places.

mayel · 2025-01-18T09:55:07Z

I also think that specs/types/callbacks should be inside ```erlang/elixir blocks. That means that we need to remove the links, but as links are not working anyway it does not matter.

how are links not working? and yeah it's either having links or formatting there, not sure which is preferable...

mayel · 2025-01-18T09:56:15Z

Hi everyone. I woud like to discuss about this more in detail, I think we could open up an issue so we don't divert the conversation from this PR? Doing the formatter I noticed the duplication but also the limitations of the current approach.

were you going to open an issue @eksperimental? otherwise not sure how to proceed here @josevalim?

josevalim · 2025-01-18T10:10:06Z

I should also say that we are adding search over HexDocs, which would allow you to search only certain packages for a given term, and submit the filtered results to a LLM. Would that be better than giving the whole docs of a bunch of deps? Which I assume would consume too many tokens?

mjrusso · 2025-01-18T12:53:38Z

I should also say that we are adding search over HexDocs, which would allow you to search only certain packages for a given term, and submit the filtered results to a LLM. Would that be better than giving the whole docs of a bunch of deps? Which I assume would consume too many tokens?

Yes, there is downstream work required to effectively use the documentation (at least for LLM consumption), which also happens to look a lot like a search problem.

This Livebook is a simple prototype; there's tons of opportunities for improvement but it does work and provide reasonable results. (This happens to use hex2txt to get the docs, but the nice part is that the approach is general and could work with any Markdown as input. I want a standalone app like this that I can run locally that exposes as a Model Content Protocol server, but that's getting off topic :)

garazdawi · 2025-01-20T10:53:53Z

how are links not working?

The links in specs are working, but the autolinks in the markdown documentation does not (that is t:String.t/0 is not resolved to anything).

yeah it's either having links or formatting there, not sure which is preferable...

I'm going to guess that this depends on what it will be used for. For the usecase that @zuiderkwast wants (that is converting to man pages), the formatting is to prefer as there are no links in man pages anyway. Either way it is easy enough for some postprocessing tool to strip links and re-format the specs.

josevalim reviewed Dec 9, 2024

View reviewed changes

lib/ex_doc/cli.ex Outdated Show resolved Hide resolved

josevalim reviewed Dec 9, 2024

View reviewed changes

lib/ex_doc/formatter/markdown.ex Outdated Show resolved Hide resolved

mayel added a commit to mayel/ex_doc that referenced this pull request Dec 26, 2024

WIP for elixir-lang#1976

28faff1

mayel added 4 commits December 27, 2024 14:13

add markdown formatter / exporter

8bd3cba

WIP for elixir-lang#1976

d295d8e

cleanup

23b5f01

update footer links & md tests

6bd15cc

mayel force-pushed the md-formatter branch from 230fbc0 to 6bd15cc Compare December 27, 2024 15:43

mayel added 2 commits December 27, 2024 16:37

add markdown rendering of AST

f2cfdc4

render spec links in md

7128381

mayel marked this pull request as ready for review December 27, 2024 16:55

mayel added 2 commits December 27, 2024 17:01

render img AST

857e316

fix warnings

eb3fbe9

josevalim reviewed Jan 2, 2025

View reviewed changes

mayel added 2 commits January 5, 2025 13:21

clean up

6c3a809

templates & rendering improvements, in part copied from elixir-lang#1992

f552c02

zuiderkwast mentioned this pull request Jan 7, 2025

Man pages missing for OTP 27.1.2 at erlang.org/downloads and github releases erlang/otp#9107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add markdown formatter / exporter #1976

add markdown formatter / exporter #1976

mayel commented Dec 8, 2024

josevalim commented Dec 9, 2024

mjrusso commented Dec 26, 2024

mayel commented Dec 26, 2024 •

edited

Loading

mjrusso commented Dec 27, 2024

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

josevalim Jan 2, 2025

mayel Jan 2, 2025

josevalim Jan 2, 2025

josevalim Jan 2, 2025

mayel Jan 2, 2025 •

edited

Loading

josevalim Jan 2, 2025

mayel Jan 2, 2025

zuiderkwast Jan 7, 2025

josevalim commented Jan 2, 2025

mayel commented Jan 2, 2025 •

edited

Loading

eksperimental commented Jan 2, 2025

garazdawi commented Jan 7, 2025

mayel commented Jan 18, 2025

mayel commented Jan 18, 2025 •

edited

Loading

josevalim commented Jan 18, 2025

mjrusso commented Jan 18, 2025

garazdawi commented Jan 20, 2025

add markdown formatter / exporter #1976

Are you sure you want to change the base?

add markdown formatter / exporter #1976

Conversation

mayel commented Dec 8, 2024

josevalim commented Dec 9, 2024

mjrusso commented Dec 26, 2024

mayel commented Dec 26, 2024 • edited Loading

mjrusso commented Dec 27, 2024

mayel commented Dec 27, 2024 • edited Loading

mayel commented Dec 27, 2024 • edited Loading

mayel commented Dec 27, 2024 • edited Loading

mayel commented Dec 27, 2024 • edited Loading

josevalim Jan 2, 2025

Choose a reason for hiding this comment

mayel Jan 2, 2025

Choose a reason for hiding this comment

josevalim Jan 2, 2025

Choose a reason for hiding this comment

josevalim Jan 2, 2025

Choose a reason for hiding this comment

mayel Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

josevalim Jan 2, 2025

Choose a reason for hiding this comment

mayel Jan 2, 2025

Choose a reason for hiding this comment

zuiderkwast Jan 7, 2025

Choose a reason for hiding this comment

josevalim commented Jan 2, 2025

mayel commented Jan 2, 2025 • edited Loading

eksperimental commented Jan 2, 2025

garazdawi commented Jan 7, 2025

mayel commented Jan 18, 2025

mayel commented Jan 18, 2025 • edited Loading

josevalim commented Jan 18, 2025

mjrusso commented Jan 18, 2025

garazdawi commented Jan 20, 2025

mayel commented Dec 26, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel commented Dec 27, 2024 •

edited

Loading

mayel Jan 2, 2025 •

edited

Loading

mayel commented Jan 2, 2025 •

edited

Loading

mayel commented Jan 18, 2025 •

edited

Loading